Constructing Genome Scale Suffix Trees
نویسنده
چکیده
Suffix trees have been the focus of significant research interest as they permit very efficient solutions to a range of string and sequence searching problems. Given a suffix tree that encodes a particular string, it is possible to solve problems such as searching for a specific pattern in time proportional to the length of the pattern rather than the length of the string. Suffix trees can also support inexact matching by dramatically improving the performance of dynamic programming. Therefore, suffix trees may enable a number of large scale bioinformatics problems to be solved more efficiently than previously thought. However, these benefits presume that a suffix tree of sufficient scale can be constructed. An inherent difficulty in suffix tree construction is that the tree construction requires a semi random walk over the tree as it is constructed. Therefore very large trees that will not fit in memory could take an unacceptably long time to construct due to excessive page faulting. In this paper we present a linear time construction algorithm that can construct suffix trees larger than memory using discrete sub-trees. The sub-trees can be constructed in parallel. The performance of the algorithm is evaluated using suffix trees constructed for chromosomes 1 and 12 of the human genome.
منابع مشابه
Constructing Chromosome Scale Suffix Trees
Suffix trees have been the focus of significant research interest as they permit very efficient solutions to a range of string and sequence searching problems. Given a suffix tree that encodes a particular string, it is possible to solve problems such as searching for a specific pattern in time proportional to the length of the pattern rather than the length of the string. Suffix trees can also...
متن کاملSuffix trees for inputs larger than main memory
A suffix tree is a fundamental data structure for string searching algorithms. Unfortunately, when it comes to the use of suffix trees in real-life applications, the current methods for constructing suffix trees do not scale for large inputs. As suffix trees are larger than the input sequences and quickly outgrow the main memory, the first attempts at building large suffix trees focused on algo...
متن کاملLinear-Time Construction of Suffix Arrays
The time complexity of suffix tree construction has been shown to be equivalent to that of sorting: O(n) for a constant-size alphabet or an integer alphabet and O(n logn) for a general alphabet. However, previous algorithms for constructing suffix arrays have the time complexity of O(n logn) even for a constant-size alphabet. In this paper we present a linear-time algorithm to construct suffix ...
متن کاملFaster Suffix Tree Construction with Missing
We consider suffix tree construction for situations with missing suffix links. Two examples of such situations are suffix trees for parameterized strings and suffix trees for two-dimensional arrays. These trees also have the property that the node degrees may be large. We add a new backpropagation component to McCreight’s algorithm and also give a high probability hashing scheme for large degre...
متن کاملRepMaestro: scalable repeat detection on disk-based genome sequences
MOTIVATION We investigate the problem of exact repeat detection on large genomic sequences. Most existing approaches based on suffix trees and suffix arrays (SAs) are limited either to small sequences or those that are memory resident. We introduce RepMaestro, a software that adapts existing in-memory-enhanced SA algorithms to enable them to scale efficiently to large sequences that are disk re...
متن کامل